Document Classification with LSA and Pretopology

نویسندگان

Murat Ahat

Sofiane Ben Amor

Marc Bui

Sandra Jhean-Larose

Guy Denhière

چکیده

Latent semantic analysis is a computation method to demonstrate a major component of language learning and use. Thus, in this sense, it is a theory of meaning, such that it applies to and offers an explanation of phenomena of meaning in words and passages of words. This enables LSA to hold a strong position in the automated document classification, document analysis, etc. Though the experiments show that LSA can reach a very high accuracy in document classification, it also depends on the various factors such as quality and amount of training documents, characteristics of representative vector and composition of the to be classified documents, etc. On the other hand, pretopology is showing its strength in the fields of data classification and modeling. Besides, some applications, which are to strengthen the pretopology with visualization in the domain of classification, have shown promising results. In this paper two document classification algorithms based on pretopology and LSA are proposed, which are suitable for different situations, and their results with deft07 contest data are discussed. This work also shows future possibility of visualization integration, which could help human intervention in the classification process. RÉSUMÉ. L’Analyse de la Sémantique Latente (LSA) est une méthode de calcul qui permet de rendre compte de l’apprentissage du langage et de son utilisation. Dans ce sens, LSA est une théorie de la signification des mots et groupes de mots (paragraphes, passages, textes) et de leur emploi. Cette propriété permet à LSA d’occuper une position enviable dans la classification automatique de documents, l’analyse de documents, etc. Bien que de nombreuses expériences indiquent que LSA peut atteindre une grande précision dans la classification de documents, ses Studia Informatica Universalis. performances sont tributaires de facteurs tels que la qualité et la quantité de documents utilisés pour l’entraı̂nement, les caractéristiques des vecteurs représentatifs et la composition des documents à classer. De son côté, la prétopologie a montré son efficacité dans les domaines de la classification des données et de la modélisation. De plus, certaines applications ont renforcé la prétopologie en ajoutant la visualisation au domaine de la classification et ont donné des résultats prometteurs. Dans cet article, nous proposons deux algorithmes de classification des documents basés sur LSA et la prétopologie, algorithmes qui sont adaptés à des situations différentes et dont nous discutons les résultats obtenus quand ils sont appliqués aux données du défi DEFT07. Ce travail dessine également les possibilités futures d’intégration de la visualisation, intégration qui pourra contribuer à l’intervention humaine dans les processus de

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Influence of domain information on Latent Semantic Analysis of Hindi text

The work presented in this paper is to evaluate the performance of Latent Semantic Analysis (LSA) model in capturing word correlations within text by including domain information in the process. The performance of the model is empirically evaluated by classification of Hindi text. The accuracies of classification are compared against plain LSA. An increase of 1.25% classification accuracy is ac...

متن کامل

Capturing the semantic structure of documents using summaries in Supplemented Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a mathematical technique that is used to capture the semantic structure of documents based on correlations among textual elements within them. Summaries of documents contain words that actually contribute towards the concepts of documents. In the present work, summaries are used in LSA along with supplementary information such as document category and domain in...

متن کامل

Document representation with Generalized Latent Semantic Analysis

Methods for dimensionality reduction, notably LSA, have been successfully applied to the information retrieval task and document classification. Recently, corpus-based association measures such as point-wise mutual information have been found to outperform LSA on a variety of tasks. We have developed an algorithmic framework that computes a low-dimensional vector space representation of documen...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Latent semantic sentence clustering for multi-document summarization

This thesis investigates the applicability of Latent Semantic Analysis (LSA) to sentence clustering for Multi-Document Summarization (MDS). In contrast to more shallow approaches like measuring similarity of sentences by word overlap in a traditional vector space model, LSA takes word usage patterns into account. So far LSA has been successfully applied to different Information Retrieval (IR) t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Stud. Inform. Univ.

دوره 8 شماره

صفحات -

تاریخ انتشار 2010

Document Classification with LSA and Pretopology

نویسندگان

چکیده

منابع مشابه

Influence of domain information on Latent Semantic Analysis of Hindi text

Capturing the semantic structure of documents using summaries in Supplemented Latent Semantic Analysis

Document representation with Generalized Latent Semantic Analysis

A New Document Embedding Method for News Classification

Latent semantic sentence clustering for multi-document summarization

عنوان ژورنال:

اشتراک گذاری